Rutuja Shivraj Pawar (220051, rutuja.pawar@ovgu.de)
Nadiia Honcharenko (220681, nadiia.honcharenko@st.ovgu.de)
Shivani Jadhav (223856, shivani.jadhav@st.ovgu.de)
Sumit Kundu (217453, sumit.kundu@st.ovgu.de)
M.Sc. Uli Niemann
A customer is a key-centric factor for any business to be successful. Conventional wisdom tells us that the cost of retaining an existing customer is far less than acquiring a new one. In order that a business has a sustainable growth, the retention of its old customer base and expansion of the new customer base is very critical. This demands from a business to understand the behaviour of its customers in relation to the business. Therefore obtaining a 360° view of its customers is crucial for a business looking for a competitive edge in the market. In such a scenario, Customer Behavioural Analytics plays an important role in leveraging data analytics to find meaningful behavioural patterns in the customer-specific business data.
Consequently, this project aims to understand the consumer behaviour in the retail sector. Decoding the consumer behaviour will be based on understanding how consumers make purchase decisions and what factors influence those decisions. This project also aims to discover the existence of dependencies between customers, products and shops to highlight further insights about their behaviour. These meaningful insights will further help a business to implement strategies leading to an increased revenue through customer satisfaction.
This project aims to address the problem of understanding the behaviour of customers of an Italian retail distribution company Coop in a single Italian city. The project intends to discover different analytical insights about the purchase behaviour of the customers through answering different formulated Research Questions (RQ)
Supermarket aggr.Customer1
The dataset to be used is the retail market data of one of the largest Italian retail distribution company called Coop for a single Italian city.
The Supermarket aggr.Customer dataset used for the analysis contains data aggregated from customer and information from shops2 (Pennacchioli et al. 2013) and pivoted to new columns. The dataset thus contains 40 features with 60,366 instances and is approximately 14.0 MB in size.
Below are the RQs which were formulated at the initial stages of the project based on a primary understanding of the data but without a detailed Exploratory Data Analysis,
1. Are customers willing to travel long distances to purchase products?
Relevance: This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions.
2. Which are the products for which customers are ready to travel long distances and for which products they select the closest shop?
Relevance: This will help to understand the nature of the products in the context of proximity. It is assumed that customers will select closest shops to buy daily products like grocery but may travel long distances to buy one-time-purchase products like kitchen equipment. As Data Science is results-driven and not based solely on intuition, this question can help to verify this assumption.
3. What is the maximum likelihood of a customer to select a particular shop to purchase a particular product?
Relevance: This will help to understand that which shops in the retail chain are in demand for a particular product. This can further facilitate better stock management to meet customer demands.
4. What is the ranking of the shops in terms of attracting more customers and thus generating more revenue and what is their individual customer base?
Relevance: This will help to understand the most popular shops in the retail chain and target different shop-level marketing schemes to the appropriate customers through customer segmentation.
5. Which are the customers that are most profitable in terms of revenue generation?
Relevance: This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.
Data Wrangling consists of different steps transforming data from the raw form into a clean form which is appropriate and accurate for data analysis. Below are the different steps which were carried out as a part of Data Wrangling,
Visualize the input dataset
library(tidyverse)
library(DataExplorer)
# Read data from the input csv file
filepath<- "Input Dataset/Supermarket aggr.Customer.csv"
supermarket_data <- read_csv(filepath)
# Converts data to tbl class. as tbl’s are easier to examine than data frames and View dataset in a spreadsheet-like display
supermarket_tbl<-tbl_df(supermarket_data)
# Check the dimension of the input dataset and the variables through a plot
plot_str(supermarket_tbl)
Generate the input dataset statistics
# Data Statistics
gather(introduce(supermarket_tbl))
Generate the input dataset summary
# Data Summary, p0 = min value, p100 = max value
library(skimr)
skim_with(integer = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim_with(numeric = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim(supermarket_tbl) %>% skimr::kable()
## Skim summary statistics
## n obs: 60366
## n variables: 40
##
## Variable type: integer
##
## variable missing complete n mean sd p0 p100
## ---------------------------------- --------- ---------- ------- --------- ---------- ---- -------
## customer_id 0 60366 60366 30183.5 17426.31 1 60366
## products_purchased 1 60365 60366 1778.71 2185.05 1 22131
## products_purchased_shop_1 0 60366 60366 887.81 1438.48 0 17016
## products_purchased_shop_2 0 60366 60366 605.37 1382.14 0 22110
## products_purchased_shop_3 0 60366 60366 156.68 723.82 0 16913
## products_purchased_shop_4 0 60366 60366 56.65 455.1 0 17445
## products_purchased_shop_5 0 60366 60366 72.18 504.96 0 20891
## shops_used 1 60365 60366 2.38 1.01 1 5
## unique_products_purchased 1 60365 60366 330.67 236 1 1465
## unique_products_purchased_shop_1 0 60366 60366 222.3 213.98 0 1459
## unique_products_purchased_shop_2 0 60366 60366 126.61 172.33 0 1161
## unique_products_purchased_shop_3 0 60366 60366 31.91 88.53 0 1048
## unique_products_purchased_shop_4 0 60366 60366 11.57 52.26 0 846
## unique_products_purchased_shop_5 0 60366 60366 15.77 61.93 0 841
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p100
## ------------------------- --------- ---------- ------- --------- --------- ------- ----------
## amount_purchased 1 60365 60366 4235.49 5006.78 0.21 51588.66
## amount_purchased_shop_1 0 60366 60366 2310.43 3509.99 0 47839.38
## amount_purchased_shop_2 0 60366 60366 1373.29 3104.61 0 44033.96
## amount_purchased_shop_3 0 60366 60366 303.26 1391.14 0 31246.35
## amount_purchased_shop_4 0 60366 60366 110.31 880.51 0 29628.03
## amount_purchased_shop_5 0 60366 60366 138.13 960.49 0 36819.42
## avg_distance_to_shops 1 60365 60366 2030.23 1119.91 6.64 9004.16
## avg_price 1 60365 60366 3.67 9.13 0.21 787.57
## avg_price_shop_1 0 60366 60366 4.68 15.96 0 787.57
## avg_price_shop_2 0 60366 60366 2.3 9.97 0 787.57
## avg_price_shop_3 0 60366 60366 0.76 1.9 0 263.23
## avg_price_shop_4 0 60366 60366 0.44 4.39 0 522.61
## avg_price_shop_5 0 60366 60366 0.45 1.33 0 133.97
## avg_purchase 1 60365 60366 8.53 10.32 0.21 787.57
## avg_purchase_shop_1 0 60366 60366 8.1 16.54 0 787.57
## avg_purchase_shop_2 0 60366 60366 4.97 11.23 0 787.57
## avg_purchase_shop_3 0 60366 60366 1.52 3.69 0 263.23
## avg_purchase_shop_4 0 60366 60366 0.76 4.9 0 522.61
## avg_purchase_shop_5 0 60366 60366 0.81 2.6 0 133.97
## distance_shop_1 0 60366 60366 2496.63 1281.53 93.28 8019.92
## distance_shop_2 0 60366 60366 2488.24 1417.36 11.19 9004.16
## distance_shop_3 0 60366 60366 1924.97 1157.75 17.84 7395.25
## distance_shop_4 0 60366 60366 2882.8 1730.09 6.64 9273.69
## distance_shop_5 0 60366 60366 2020.9 1260.85 25.46 7465.81
## max_distance_to_shops 1 60365 60366 2942.67 1327.53 6.64 9267.7
## min_distance_to_shops 1 60365 60366 1396.44 1048.42 6.64 9004.16
Eliminate the missing values in the input dataset
# Eliminate the missing values in the dataset
supermarket_tbl_Clean1<-na.omit(supermarket_tbl)
na.action(supermarket_tbl_Clean1)
## 44617
## 44617
## attr(,"class")
## [1] "omit"
# Percentage of data missing (Should be 0%)
missing_data_count = sum(!complete.cases(supermarket_tbl_Clean1))
total_data = dim(supermarket_tbl_Clean1)[1]
missing_data_percent = (missing_data_count/total_data) * 100
missing_data_percent
## [1] 0
Eliminate the duplicate rows in the input dataset
# Eliminate duplicate rows
distinct(supermarket_tbl_Clean1)
Round the decimal values in the input dataset
# Round the decimal value columns upto 4 decimal places
is.num <- sapply(supermarket_tbl_Clean1, is.numeric)
supermarket_tbl_Clean1[is.num] <- lapply(supermarket_tbl_Clean1[is.num], round, 4)
Rename the column names in the input dataset
# Rename column names
## From products_purchased to products_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'products_purchased'] <- 'products_purchased_total'
## From shops_used to shops_used_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'shops_used'] <- 'shops_used_total'
## From amount_purchased to amount_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'amount_purchased'] <- 'amount_purchased_total'
## From min_distance_to_shops to min_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'min_distance_to_shops'] <- 'min_dist_to_custSel_shops'
## From max_distance_to_shops to max_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'max_distance_to_shops'] <- 'max_dist_to_custSel_shops'
## From unique_products_purchased to unique_products_purchased_total_exclCommon
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'unique_products_purchased'] <- 'unique_products_purchased_total_exclCommon'
## From avg_distance_to_shops to avg_distance_to_all_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_distance_to_shops'] <- 'avg_distance_to_all_shops'
## From avg_price_shop_1 to avg_product_price_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_1'] <- 'avg_product_price_shop_1'
## From avg_price_shop_2 to avg_product_price_shop_2
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_2'] <- 'avg_product_price_shop_2'
## From avg_price_shop_3 to avg_product_price_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_3'] <- 'avg_product_price_shop_3'
## From avg_price_shop_4 to avg_product_price_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_4'] <- 'avg_product_price_shop_4'
## From avg_price_shop_5 to avg_product_price_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_5'] <- 'avg_product_price_shop_5'
## From avg_price to avg_purchased_product_price_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price'] <- 'avg_purchased_product_price_allShops'
## From avg_purchase_shop_1 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_1'] <- 'avg_purchase_amount_shop_1'
## From avg_purchase_shop_2 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_2'] <- 'avg_purchase_amount_shop_2'
## From avg_purchase_shop_3 to avg_purchase_amount_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_3'] <- 'avg_purchase_amount_shop_3'
## From avg_purchase_shop_4 to avg_purchase_amount_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_4'] <- 'avg_purchase_amount_shop_4'
## From avg_purchase_shop_5 to avg_purchase_amount_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_5'] <- 'avg_purchase_amount_shop_5'
## From avg_purchase to avg_purchase_amount_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase'] <- 'avg_purchase_amount_allShops'
Reorder the columns in the input dataset
# Reorder Columns
supermarket_tbl_Clean1 <- supermarket_tbl_Clean1[c(1,10,11,12,13,14,15,3,4,2,16,17,18,19,20,5,21,22,23,24,25,6,36,37,38,39,40,9,26,27,28,29,30,7,31,32,33,34,35,8)]
Write the cleaned dataset tbl to a CSV file
# Write the cleaned data tbl to csv
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
write.csv(supermarket_tbl_Clean1, file = clean_filepath, row.names = FALSE)
Visualize the cleaned dataset
# Check the dimension of the cleaned dataset and the variables
plot_str(supermarket_tbl_Clean1)
Generate the cleaned dataset statistics
# Cleaned data Statistics
gather(introduce(supermarket_tbl_Clean1))
Analyze the Continuous Variables in the cleaned dataset as a Histogram
# Analyze Continuous Variables in the cleaned dataset (Univariate Analysis)
plot_histogram(supermarket_tbl_Clean1)
Interpretation: The plotted histograms depict the distribution of each continuous variable in the dataset. These plots can be used to understand the data spread, whether the data is symmetric or skewed and graphically summarize the univariate dataset distribution.
Examine the correlated features in the cleaned dataset through Correlation Analysis
# Correlation analysis (Multivariate Analysis, On Continuous features only) to examine corelated features in the cleaned dataset
library(ggcorrplot)
corr <- round(cor(supermarket_tbl_Clean1), 1)
ggcorrplot(corr, outline.col = "white") +geom_tile(height=1.8, width=1.8) +
scale_fill_gradient2(low="blue", mid="white", high="red") +
theme_minimal() +
coord_equal() +
labs(x="",y="",fill="Correlation coefficient") +
theme(axis.text.x=element_text(size=7, angle=90, vjust=1, hjust=1,
margin=margin(-3,0,0,0)),
axis.text.y=element_text(size=7, margin=margin(0,-3,0,0)),
panel.grid.major=element_blank())
Interpretation: The Correlation plot depicts the association between the variables in the dataset and the degree of association between them is displayed by the variation in the correlation coefficient color. The plot helps to understand the relationship between the different variables in the dataset.
Visualize the deviation from a specific probability distribution in the cleaned dataset through Quantile-Quantile plot
# Quantile-Quantile plot to visualize the deviation from a specific probability distribution in the cleaned dataset
plot_qq(supermarket_tbl_Clean1)
Interpretation: The Quantile-Quantile plot compares two probability distributions through plotting their quantiles against each other. This plot depicts if both sets of quantiles are from the same distribution (points form a roughly straight line), helping to further understand the data distribution.
Data Preparation for RQ1
The unimportant and redundant features providing no meaningful information given the context are first removed from the dataset. The dataset also does not have any ground truth associated with it. In order to work on RQ1, this was required and which was created based on certain assumptions. It is assumed that the shop to which the customers visit the most is their most preferred shop. New columns were generated, to find the closest shop (a), shops from which most products are purchased (b), shops from which most unique products are purchased (c), shops offering least average product price (d) and shops at which the maximum amount of money is spent by a customers (e). The most frequent shop amongst them (b, c, d and e) is assigned as the most preferred shop (f). The customers were categorized by comparing (b) with (f). If the values matched, class - ‘no’ is assigned based on the assumption that customers have choosen the closest shop as their most preferred shop as they do not like to travel long distances and class - ‘yes’ otherwise. Further the dataset is re-arranged and stored along with the newly created class label.
library(tidyverse)
library(dplyr)
## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
## Read data from a file
supermarket_data_clean <- read_csv(file_path)
## List of all features present in the data frame
all_features <- colnames(supermarket_data_clean)
## List of selected features
sel_features <- all_features[-c(1,2,8,9,10,16,22,28,34,35,36,37,38,39,40)]
## Create a data frame with only the selected features
supermarket_data_model <- supermarket_data_clean %>% select(sel_features)
## Generate new columns to find the closest shop, shops from which most products are purchased, shops from which most unique products are purchased, shops offfering least average product price and shops at which maximum amount of money is spent by a customers.
supermarket_data_model$min_dist <- str_sub(colnames(supermarket_data_model[,1:5]),-1,-1)[apply(supermarket_data_model[,1:5],1,which.min)]
supermarket_data_model$most_prod_purch_from <- str_sub(names(supermarket_data_model[,6:10]),-1,-1)[max.col(supermarket_data_model[,6:10], "last")]
supermarket_data_model$most_uni_prod_purch_from <- str_sub(names(supermarket_data_model[,11:15]),-1,-1)[max.col(supermarket_data_model[,11:15], "last")]
supermarket_data_model$least_avg_prod_pri <- str_sub(colnames(supermarket_data_model[,16:20]),-1,-1)[apply(supermarket_data_model[,16:20],1,which.min)]
supermarket_data_model$max_amt_purch <- str_sub(names(supermarket_data_model[,21:25]),-1,-1)[max.col(supermarket_data_model[,21:25], "last")]
## Create a data frame having only the newly generated columns
test <- supermarket_data_model[,26:30]
## Gnereate new columns - most preferred shop and class to which the customer belongs
for (row in 1:nrow(test)){
## Create a vector for each row
vec <- c(test[row, "most_prod_purch_from"], test[row, "most_uni_prod_purch_from"], test[row, "least_avg_prod_pri"], test[row, "max_amt_purch"])
## Sort and find the most preferred shop
supermarket_data_model[row, "most_pref"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
## Assign lables to customers (0 or 'no' - 'Not willing to travel far for shopping' and 1 or 'yes'- 'Willing to travel far for shopping')
if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref"]){
supermarket_data_model[row, "class"] <- 'No'
} else{
supermarket_data_model[row, "class"] <- 'Yes'
}
}
## Re-order columns
supermarket_data_class <- supermarket_data_model[c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,32)]
## Set file path
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_Data_Classification.csv"
## Write the dataframe to csv file
write.csv(supermarket_data_class, file = clean_filepath, row.names = FALSE)
Data Preparation for RQ3
The unimportant and redundant features providing no meaningful information given the context are first removed from the dataset. The most preferred shop (e) for each customer based on the same assumption used for RQ1 is generated and used as the class label. New columns are generated, to find the average distance travelled (a), average number of products purchased (b), average number of unique products purchased (c), average product price (d) and the average amount spent (e) for each customer. Further, a dataset using these newly created columns - (a), (b), (c), (d) and (e) is generated to be used for creating a model that would predict the shop most likely to be selected by a customer as their most preferred shop.
library(tidyverse)
library(dplyr)
## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
## Read data from a file
supermarket_data_clean <- read_csv(file_path)
## Generate columns to find the average distance travelled, average number of products purchased, average number of unique products purchased, average product price
## and average amount spent by each customer
supermarket_data_clean$distance_avg <- with(supermarket_data_clean, avg_distance_to_all_shops)
supermarket_data_clean$products_purchased_avg <- with(supermarket_data_clean, round(products_purchased_total/shops_used_total))
supermarket_data_clean$unique_products_purchased_avg <- with(supermarket_data_clean, round(unique_products_purchased_total_exclCommon/shops_used_total))
supermarket_data_clean$product_price_avg<- with(supermarket_data_clean, avg_purchased_product_price_allShops)
supermarket_data_clean$amount_purchased_avg<- with(supermarket_data_clean, avg_purchase_amount_allShops)
## Generate new columns to find shops from which most products are purchased, shops from which most unique products are purchased,
## shops offfering least average product price and shops at which maximum amount of money is spent by a customers.
supermarket_data_clean$most_prod_purch_from <- str_sub(names(supermarket_data_clean[,11:15]),-6,-1)[max.col(supermarket_data_clean[,11:15], "random")]
supermarket_data_clean$most_uni_prod_purch_from <- str_sub(names(supermarket_data_clean[,17:21]),-6,-1)[max.col(supermarket_data_clean[,17:21], "random")]
supermarket_data_clean$least_avg_prod_pri <- str_sub(colnames(supermarket_data_clean[,23:27]),-6,-1)[apply(supermarket_data_clean[,23:27],1,which.min)]
supermarket_data_clean$max_amt_purch <- str_sub(names(supermarket_data_clean[,29:33]),-6,-1)[max.col(supermarket_data_clean[,29:33], "random")]
## Create a data frame having only the newly generated columns
gen_df <- supermarket_data_clean[,46:49]
## Gnereate new columns - most preferred shop and class to which the customer belongs
for (row in 1:nrow(gen_df)){
## Create a vector for each row
vec <- c(gen_df[row, "most_prod_purch_from"], gen_df[row, "most_uni_prod_purch_from"], gen_df[row, "max_amt_purch"])
## Sort and find the most preferred shop
supermarket_data_clean[row, "most_pref_shop"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
}
## Re-order columns
supermarket_data_predict <- supermarket_data_clean[c(41,42,43,44,45,50)]
## Set file path
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
## Write the dataframe to csv file
write.csv(supermarket_data_predict, file = clean_filepath, row.names = FALSE)
Exploratory Data Analysis (EDA) is the process of visualizing the main characteristics in the data before the formal modelling on the data to discover data patterns and verify the initial primary assumptions made on the data. Below are the visualizations of the EDA carried out,
1) Visualize the highest revenue generating shops, shops selling the highest number of products, shops selling the highest number of unique products and the relation between them
library(ggplot2)
library(RColorBrewer)
library(scales)
file_path<- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)
Shop<- c(1,2,3,4,5)
# Revenue generation by Shops 1-5
# columns selected: amount_purchased_shop_1, 2, 3, 4, 5
slice1<-select(supermarket_data_clean, 29,30,31,32,33)
amountS1<-sum(slice1$amount_purchased_shop_1)
amountS2<-sum(slice1$amount_purchased_shop_2)
amountS3<-sum(slice1$amount_purchased_shop_3)
amountS4<-sum(slice1$amount_purchased_shop_4)
amountS5<-sum(slice1$amount_purchased_shop_5)
# create data frame for Revenue generation
Revenue_Generated<- c(amountS1,amountS2,amountS3,amountS4,amountS5)
Revenue<- data.frame(Shop, Revenue_Generated)
rownames(Revenue) <- NULL
# Products Sold by Shops 1-5
# columns selected: products_purchased_shop_1, 2, 3, 4, 5
slice2<-select(supermarket_data_clean, 11,12,13,14,15)
productsS1<-sum(slice2$products_purchased_shop_1)
productsS2<-sum(slice2$products_purchased_shop_2)
productsS3<-sum(slice2$products_purchased_shop_3)
productsS4<-sum(slice2$products_purchased_shop_4)
productsS5<-sum(slice2$products_purchased_shop_5)
# create data frame for Products Sold
Products_Purchased<- c(productsS1,productsS2,productsS3,productsS4,productsS5)
ProductsSold<- data.frame(Shop, Products_Purchased)
rownames(ProductsSold) <- NULL
# Unique products Sold by Shops 1-5
# columns selected: unique_products_purchased_shop_1,2,3,4,5
slice3<-select(supermarket_data_clean, 17,18,19,20,21)
uproductsS1<-sum(slice3$unique_products_purchased_shop_1)
uproductsS2<-sum(slice3$unique_products_purchased_shop_2)
uproductsS3<-sum(slice3$unique_products_purchased_shop_3)
uproductsS4<-sum(slice3$unique_products_purchased_shop_4)
uproductsS5<-sum(slice3$unique_products_purchased_shop_5)
# create data frame for Unique products Sold
UProducts_Purchased<- c(uproductsS1,uproductsS2,uproductsS3,uproductsS4,uproductsS5)
UProductsSold<- data.frame(Shop, UProducts_Purchased)
rownames(UProductsSold) <- NULL
# Plot a Bar graph to depict the above calculated data
Legends <-c(rep("Revenue Generated", 5), rep("Products Sold", 5), rep("Unique Products Sold", 5))
values <-c(Revenue_Generated, Products_Purchased, UProducts_Purchased)
mydata <-data.frame(Shop, values)
p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", aes(fill = Legends), position = "dodge") +
xlab("Shop") + ylab("Total") +
ggtitle("Relation between Revenue and Products Sold") +
theme_bw() + scale_y_continuous(labels = scales::comma)
Analysis: As visualized, the shops ordered based on their highest Revenue Generated is Shop 1, 2, 3, 5, 4. The shops ordered based on their highest amount of Products Sold is Shop 1, 2, 3, 5, 4. The shops ordered based on their highest amount of Unique Products Sold is Shop 1, 2, 3, 4 & 5 (are on the same level). The relation between these parameters as visualized based on the shop order can be determined as the shop generating the highest revenue has the highest amount of products sold (unique included). So in the dataset, the relation between the revenue generated and the products sold is directly proportional to each other. The ordering of the shops is mostly stable here at Shop 1, 2, 3, 5, 4.
2) Visualize the approximate customer base count for the different shops
# Approximate Customer Base for Shops 1-5
C1<-slice2$products_purchased_shop_1
custS1<-length(which(C1 !=0))
C2<-slice2$products_purchased_shop_2
custS2<-length(which(C2 !=0))
C3<-slice2$products_purchased_shop_3
custS3<-length(which(C3 !=0))
C4<-slice2$products_purchased_shop_4
custS4<-length(which(C4 !=0))
C5<-slice2$products_purchased_shop_5
custS5<-length(which(C5 !=0))
# create data frame for Approximate Customer Base
Customers<- c(custS1,custS2,custS3,custS4,custS5)
TotalCustomers<- data.frame(Shop, Customers)
rownames(TotalCustomers) <- NULL
# Plot a Bar graph to depict the approximate Customer Base
values <-c(Customers)
mydata <-data.frame(Shop, values)
p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", fill = "gray" , position = "dodge", color = "black") +
xlab("Shop") + ylab("Total") +
ggtitle("Customer Base") +
theme_bw()
Analysis: As visualized, the shops ordered based on their highest approximate customer base is Shop 1, 2, 3, 5, 4. So the highest approximate customer base for a shop determines its popularity in terms of a customer’s product purchase from the shop. The ordering of the shops is here as Shop 1, 2, 3, 5, 4.
3) Visualize the relationship between average prices and distances to the shop
library(modelr)
library(gridExtra)
cleared_supermarket_data<-read_csv(file_path)
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)
shop_ordered_slice <- select(cleared_supermarked_tbl, 3,23,4,24,5,25,6,26,7,27) %>%
bind_cols(cleared_supermarked_tbl[,8:10], cleared_supermarked_tbl[,28])
# Splitting the data
get_slice_for_shop <- function(col1, col2){
shop_slice <- shop_ordered_slice[,col1:col2]
colnames(shop_slice) <- c("distance","price")
return(shop_slice)
}
shop_1_data <- get_slice_for_shop(1,2)
shop_2_data <- get_slice_for_shop(3,4)
shop_3_data <- get_slice_for_shop(5,6)
shop_4_data <- get_slice_for_shop(7,8)
shop_5_data <- get_slice_for_shop(9,10)
shop_avg_data <- get_slice_for_shop(13,14)
shop_agg_min_data <- get_slice_for_shop(11,14)
shop_agg_max_data <- get_slice_for_shop(12,14)
# Combine data to the one mutated table to show all shops at the one graph
joined_shops_data <- mutate(shop_1_data, Shop="1") %>%
union_all(mutate(shop_2_data, Shop="2")) %>%
union_all(mutate(shop_3_data, Shop="3")) %>%
union_all(mutate(shop_4_data, Shop="4")) %>%
union_all(mutate(shop_5_data, Shop="5"))
# Create base for plots
get_base_for_plot <- function(dataset, caption){
plot_base <- ggplot(data = dataset, mapping = aes(x = distance, y = price)) + ggtitle(caption)
return(plot_base)
}
# Colours list
colours_shema <- c("Red", "Green", "Yellow", "Pink", "Blue", "Purple", "steelblue1", "tomato1")
#create scatter plot
add_geom_point <- function(colorNum){
geom_p <- geom_point(colour=colours_shema[colorNum], alpha=0.3)
return(geom_p)
}
#draw scatter plot with plot base
draw_cov_point_plot <- function(dataset, colorNum, caption){
cov_geom_plot <- get_base_for_plot(dataset, caption) + add_geom_point(colorNum) +
scale_y_continuous(trans="log2")+
geom_smooth(stat = 'smooth', color = 'Black', method = 'gam', formula = y ~ s(x, bs = "cs"))
return(cov_geom_plot)
}
p1_1 <- draw_cov_point_plot(shop_1_data, 1, "Shop 1") + theme_bw()
p2_1 <- draw_cov_point_plot(shop_2_data, 2, "Shop 2") + theme_bw()
p3_1 <- draw_cov_point_plot(shop_3_data, 3, "Shop 3") + theme_bw()
p4_1 <- draw_cov_point_plot(shop_4_data, 4, "Shop 4") + theme_bw()
p5_1 <- draw_cov_point_plot(shop_5_data, 5, "Shop 5") + theme_bw()
pavg_1 <- draw_cov_point_plot(shop_avg_data, 6, "Average price with average distance") + theme_bw()
pmin_1 <- draw_cov_point_plot(shop_agg_min_data, 7, "Average price with min distance") + theme_bw()
pmax_1 <- draw_cov_point_plot(shop_agg_max_data, 8, "Average price with max distance") + theme_bw()
pall_1 <- get_base_for_plot(joined_shops_data, "All shops") + geom_point(mapping = aes(colour = Shop), alpha=0.3) + theme_bw()
comb_cov_shops <- grid.arrange(p1_1, p2_1, p3_1, p4_1, p5_1,
nrow=2, ncol=3,
top="Covariation between distances and average prices")
comb_cov_aggrs <- grid.arrange(pmin_1, pmax_1,
nrow=2,
top= "Covariation between min/max distances and average prices")
comb_cov_avg <- grid.arrange(pall_1, pavg_1,
nrow=2,
top= "Covariation between distances and average prices (aggregated)")
Analysis: There are strong dependencies between long average distance and the average price in a shop. Also, the average price is close enough to zero, therefore it makes a sense to check for zero values for the price in the current dataset and its influence.
4) Visualize data gap for the average price in each shop
prices <- shop_ordered_slice[,seq(2, 10 ,2)]
names(prices) <- c("Shop 1", "Shop 2", "Shop 3", "Shop 4", "Shop 5")
dataset_with_na <- data.frame(sapply(prices, function(x) {
na_if(x,0)
} ))
plot_missing(dataset_with_na)
Analysis: As visualized, there is a data gap for the value of the average price in each shop. The value for average price in a shop for a customer is only filled in the dataset if the particular customer prefers the shop, else it is left as zero. This can be considered as a data gap but at the same time, it is an information which is never utilized during analysis, meaning that the customer does not choose that particular shop in the first place. But this data gap does not affect on the tendency of the relationship between price and distance.
5) Visualize patterns for the average price
joined_shops_without_null <- filter(joined_shops_data, price != 0)
mod <- lm(log(price) ~ distance, data = joined_shops_without_null)
joined_shops_data2 <- joined_shops_without_null %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
pall_4 <- ggplot(data = joined_shops_data2, aes(x = Shop, y = resid)) +
geom_bar(stat = "identity", fill = colours_shema[6]) + ggtitle("Average price pattern") +
theme_bw()
pall_4
Analysis: As visualized, the residuals gave us a view of the average price after removing the distance effect. Once the strong relationship between distance and price has been removed, relationship to other external factors become noticeable.
6) Visualize the revenue generated by each shop
## Call the packages
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(plotly)
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)
### Loyality Score based on Revenue ###
## Calculate the revenue generated for each shop
revenue_shop_1 <- sum(supermarket_data_clean$amount_purchased_shop_1)
revenue_shop_2 <- sum(supermarket_data_clean$amount_purchased_shop_2)
revenue_shop_3 <- sum(supermarket_data_clean$amount_purchased_shop_3)
revenue_shop_4 <- sum(supermarket_data_clean$amount_purchased_shop_4)
revenue_shop_5 <- sum(supermarket_data_clean$amount_purchased_shop_5)
## Create a vector
revenue <- c(revenue_shop_1, revenue_shop_2, revenue_shop_3, revenue_shop_4, revenue_shop_5)
shops <- c("shop 1", "shop 2", "shop 3", "shop 4", "shop 5")
## Create a data frame to store the vectors
revenue_per_shop <- data.frame(shops, revenue)
## Generate a plot
rps_plot <- ggplot(revenue_per_shop, aes(shops, revenue))
## Add featurs to the plot
rps_plot + geom_bar(stat = "identity", width = 0.6, position = "dodge2") +
xlab("Shops") + ylab("Revenue") +
ggtitle("Revenue Generated") +
theme_bw() + scale_y_continuous(labels = scales::comma)
Analysis: As visualized, the revenue generated by each shop can be calculated and used to generate a list of top N customers based on their contribution.
7) Visualize the most preferred shop by the customer with respect to average unique products purchased and average products purchased
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(readr)
library(dplyr)
file_path<- "Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
supermarket_data_predict <- read_csv(file_path)
ggplot(data=supermarket_data_predict,aes(x=supermarket_data_predict$products_purchased_avg,y=supermarket_data_predict$unique_products_purchased_avg))+
geom_point(aes(colour=factor(supermarket_data_predict$most_pref_shop))) +labs(colour="Most_Pref_Shop")+theme_bw()+ylab("Avg unique products purchased")+xlab("Products purchased average")
Analysis: As visualized, it can be inferred that, given the data it is possible to predict which shop the customer would select.
Exploratory Data Analysis provided a feasibility check on the Initial RQs formulated. The EDA phase helped to get a better understanding of the data in relation to the project objective. Hence this leads to the modification, removal or addition of new RQ. Below is the final set of RQ formulated which will be answered through this project,
1. Are customers willing to travel long distances to purchase products?
Relevance: This will help to understand the majority of the customer trends towards long distance travel to purchase products.
2. What are the factors that contribute towards the long distance travel of the customer to purchase products?
Relevance: This will help to understand the important factors that contribute towards the majority of the customers willing to travel long distances to purchase products, in turn better understanding the purchase behaviour of the customers.
3. What is the maximum likelihood of a customer to select a particular shop?
Relevance: This will help to understand which shops in the retail chain which are most likely to be preferred by new customers. This can further facilitate better stock management to meet increasing customer demands.
4. What are the different customer segments based on their purchase behaviour?
Relevance: This will help to understand the groups of the similar customer based on their purchase behaviour and target different shop-level marketing schemes to the appropriate customers.
5. Which are the Top 100 customers that are most profitable in terms of revenue generation for each shop?
Relevance: This will help to understand the top profitable customers for the business and help to target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.
1. Are customers willing to travel long distances to purchase products?
Algorithms selected: Support Vector Machine (SVM), K-nearest neighbor (k-NN), Random Forest
Reason for Algorithm Selections: In the event given that the relationship between two variables is non-linear and we are handling a two-class classification problem, SVM is the most accurate choice (Hsu et al. 2003).
k-NN algorithm, runs generally slower and has lower accuracy in comparison with that of SVM, but exhibits certain practical qualities. It is easy to train k-NN as it is a lazy algorithm. Consequently, it is easy to use and eventually easy to understand the results (Soucy and Mineau 2001).
In comparison with k-NN classification, Random Forest is a great algorithm to train early in the model development process, to see how it performs. Considering the context of the difference between Forest and Vectors algorithms it should be mentioned, that with Random Forest data can be used as it is whereas SVM maximizes the “margin” and thus relies on the concept of “distance” between different points. This tree-algorithm is also very hard to beat in terms of performance. Moreover, in contrast with SVM and k-NN Random Forest does not demand parameter tuning to reach a high accuracy (Liaw, Wiener, and others 2002).
Features Selected: distance_shop_1, distance_shop_2, distance_shop_3, distance_shop_4, distance_shop_5, products_purchased_shop_1, products_purchased_shop_2, products_purchased_shop_3, products_purchased_shop_4, products_purchased_shop_5, unique_products_purchased_shop_1, unique_products_purchased_shop_2, unique_products_purchased_shop_3, unique_products_purchased_shop_4, unique_products_purchased_shop_5, avg_product_price_shop_1, avg_product_price_shop_2, avg_product_price_shop_3, avg_product_price_shop_4, avg_product_price_shop_5, amount_purchased_shop_1, amount_purchased_shop_2, amount_purchased_shop_3, amount_purchased_shop_4, amount_purchased_shop_5, class are of importance and are selected for this RQ.
Analysis
1) Data preparation for classification
library(caret)
library(randomForest)
library(e1071)
library(ggplot2)
library(dplyr)
library(tidyverse)
# Creating of useful functions
create_conf_matrix <- function(refLabels, predictLabels, positiveLabel){
conf_matrix <- confusionMatrix(
refLabels, # reference labels
predictLabels, # predicted labels
positive = positiveLabel, # label that corresponds to a "positive" results (optional)
dnn = c("actual", "predicted") # names of the confusion matrix dimensions (optional)
)
return (conf_matrix)
}
get_evaluation <- function(refLabels, predictLabels, positiveLabel){
conf_matrix <- create_conf_matrix(refLabels, predictLabels, positiveLabel)
conf_matrix
print(conf_matrix$overall["Accuracy"])
print(conf_matrix$byClass["Sensitivity"])
print(conf_matrix$byClass["Specificity"])
}
# Data preparation
cleared_supermarket_data <- read_csv("Input Dataset/Cleaned Dataset/Supermarket_Data_Classification.csv")
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)
cleared_supermarked_tbl$class <- as.factor(cleared_supermarked_tbl$class)
2) k-fold cross validation
flds <- createFolds(factor(cleared_supermarked_tbl$class), k = 5, list = FALSE, returnTrain = TRUE)
comb_factor <- tbl_df(cbind(cleared_supermarked_tbl, flds))
train_folders <- c(1,3,4)
test_folders <- c(2,5)
train_data <- cleared_supermarked_tbl[comb_factor$flds %in% train_folders,]
test_data <- cleared_supermarked_tbl[comb_factor$flds %in% test_folders,]
# splitting data to test and training
train_ds <- train_data[, -26]
y_train <- train_data %>%
pull(class)
y_test <- test_data %>%
pull(class)
test_ds <- test_data[, -26]
3) Functions for classifying and plotting
classify_with_fit <- function(fit, title){
train_predicted <- predict(fit, train_ds, type = "class")
print("Evaluation for the training")
get_evaluation(y_train, train_predicted, "Yes")
predicted <- predict(fit, test_ds, type = "class")
print("Evaluation for the tests")
get_evaluation(y_test, predicted, "Yes")
draw_plot_for_classes(test_ds, predicted, title)
}
draw_plot_for_classes <- function(data, predicted, title){
plot_data <- cbind(data, predicted)
ggplot(plot_data, aes(x = predicted, fill = predicted)) +
geom_bar() +
xlab("Prediction") + ylab("Customer count") +
theme_bw() +
theme(legend.title = element_blank()) +
ggtitle(title)
}
Classification
4) Classification with SVM
#tune SVM
tuneSvm <- tune(svm, class ~ ., data = train_data, ranges = list(gamma = 2^(-1:1)),
cost = 2^(2:4), tunecontrol = tune.control(sampling = "fix"))
summary(tuneSvm)
##
## Parameter tuning of 'svm':
##
## - sampling method: fixed training/validation set
##
## - best parameters:
## gamma
## 0.5
##
## - best performance: 0.0525967
##
## - Detailed performance results:
## gamma error dispersion
## 1 0.5 0.05259670 NA
## 2 1.0 0.06626356 NA
## 3 2.0 0.09392860 NA
plot(tuneSvm)
#classify with best params
svmFit <- svm(class ~ ., data = train_data, kernel = "radial",
cost = 1, gamma = 0.5,
scale=TRUE, cachesize=95)
#plot(svmFit, train_data)
classify_with_fit(svmFit, "SVM classification")
## [1] "Evaluation for the training"
## Accuracy
## 0.9762279
## Sensitivity
## 0.9777037
## Specificity
## 0.9725791
## [1] "Evaluation for the tests"
## Accuracy
## 0.9479417
## Sensitivity
## 0.9508535
## Specificity
## 0.9403409
5) Classification with k-NN
# train knn classifier
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnTrain <- train(class ~ ., data = train_data, method = "knn",
trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
plot(knnTrain)
#classify with best params
knnFit <- knn3(train_ds, y_train, k = 7)
classify_with_fit(knnFit, "KNN classification")
## [1] "Evaluation for the training"
## Accuracy
## 0.9541125
## Sensitivity
## 0.960798
## Specificity
## 0.9374094
## [1] "Evaluation for the tests"
## Accuracy
## 0.9379193
## Sensitivity
## 0.9475629
## Specificity
## 0.91347
6) Classification with Random Forest
randomFit <- randomForest(class ~ ., train_data, ntree=500)
classify_with_fit(randomFit, "Random Forest")
## [1] "Evaluation for the training"
## Accuracy
## 1
## Sensitivity
## 1
## Specificity
## 1
## [1] "Evaluation for the tests"
## Accuracy
## 0.9761451
## Sensitivity
## 0.978052
## Specificity
## 0.971445
Observations: It was found out that Yes the majority of the customers are ready to travel long distances to purchase products and which is affected by certain factors. Binary classification gives an opportunity to divide the data into two separate classes, which will help to understand whether the buyer will travel a long distance to the store or not, based on certain factors. Thus, the research question is answered by classifying the customers based on certain features to identify whether or not the majority of them are ready to travel long distances to purchase products.
Applications: These insights obtained can be further utilized by the business to understand the behaviour trend of the majority of the customers related to long-distance travel to purchase products. This further paves way for the business to understand the reasons behind such majority trends. Eventually, this helps the business to devise strategies in the context of the store locations coupled with enhancing the factors influencing such trends and thus generating more revenue for the business with increased customer satisfaction..
2. What are the factors that contribute towards the long distance travel of the customer to purchase products?
Algorithms selected: Custom Algorithm and as a follow-up investigation for RQ1
Features Selected: Same as those selected for RQ1
Analysis
1) Determine the responsible factors
Out of all the 5 shops in the chain, every customer has a shop which they prefer the most. To find out the factors that are responsible for the customers selecting a particular shop as the most preferred shop, 7 new columns were generated. New columns were generated, to find out the shop which are closest to every customer (a), the shop from which they buy the most number of products (b), the shops from which they buy the most number of unique products (c), the shops at which they have the least average price (d), the shops at which they spend the most amount of money (e), the most preferred shop (f) and the factor based on which they choose their most preferred shop (g). To generate the values for the column - (f), the most frequent value in the columns - (b), (c), (d) and (e) is selected and assigned to the customers as the most preferred shop. Based on the most preferred shop for each customer, they are assigned different factors. The customers who have the closest shop, shop with least average price or shop with most unique products purchased as the most preferred shop, they are assigned ‘dist’, ‘price’ or ‘satisf’ respectively as the factor. If they have a combination of these three factors, they are assigned ‘dist_price’, ‘dist_satisf’, ‘price_satisf’ or ‘dist_price_satisf’ respectively.
Every product has a sophistication value attached with it and is meant to satisfy the needs of a customer. Higher the sophistication value of a product, higher satisfaction they provide to the buyer. The products needed for daily use such as bread, water, etc. are bought in bulk and are considered less sophisticated than the ones that are bought in comparatively lesser quantities such as DVD player, coffee flask, etc. So, here in this RQ it is assumed that, higher the number of unique products bought, higher is the satisfaction of the customer and more cost they will be ready to pay in terms of either distance to travel or in terms of the price of the items.
library(tidyverse)
library(dplyr)
## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
## Read data from a file
supermarket_data_clean <- read_csv(file_path)
## List of all features present in the data frame
all_features <- colnames(supermarket_data_clean)
## List of selected features
sel_features <- all_features[-c(1,2,8,9,10,16,22,28,34,35,36,37,38,39,40)]
## Create a data frame with only the selected features
supermarket_data_model <- supermarket_data_clean %>% select(sel_features)
## Generate new columns to find the closest shop, shops from which most products are purchased, shops from which most unique products are purchased, shops offfering least average product price and shops at which maximum amount of money is spent by a customers.
supermarket_data_model$min_dist <- str_sub(colnames(supermarket_data_model[,1:5]),-1,-1)[apply(supermarket_data_model[,1:5],1,which.min)]
supermarket_data_model$most_prod_purch_from <- str_sub(names(supermarket_data_model[,6:10]),-1,-1)[max.col(supermarket_data_model[,6:10], "last")]
supermarket_data_model$most_uni_prod_purch_from <- str_sub(names(supermarket_data_model[,11:15]),-1,-1)[max.col(supermarket_data_model[,11:15], "last")]
supermarket_data_model$least_avg_prod_pri <- str_sub(colnames(supermarket_data_model[,16:20]),-1,-1)[apply(supermarket_data_model[,16:20],1,which.min)]
supermarket_data_model$max_amt_purch <- str_sub(names(supermarket_data_model[,21:25]),-1,-1)[max.col(supermarket_data_model[,21:25], "last")]
## Create a data frame having only the newly generated columns
test <- supermarket_data_model[,26:30]
## Gnereate new columns - most preferred shop and categorise the customers based on the factors
for (row in 1:nrow(test)){
## Create a vector for each row
vec <- c(test[row, "most_prod_purch_from"], test[row, "most_uni_prod_purch_from"], test[row, "least_avg_prod_pri"], test[row, "max_amt_purch"])
## Sort and find the most preferred shop
supermarket_data_model[row, "most_pref_shop"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
## Assign lables to customers (0 or 'dist' - 'distance', 1 or 'price' - 'price', 2 or 'satisf' - 'satisfaction', 3 or 'dist_price' - 'distance and price', 4 or 'dist_satisf' - 'distance and satisfaction', 5 or 'price_satisf' - 'price and satisfaction' and 6 or 'dist_price_satisf' - 'distance, price and satisfaction')
if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 6
}
else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 3
}
else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 4
}
else if (supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 5
}
else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 0
}
else if (supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 1
}
else if (supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 2
}
else{
supermarket_data_model[row, "factor"] <- 7
}
}
## Re-order columns
supermarket_data_clus <- supermarket_data_model[c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,32)]
2) Visualize the responsible factors
## Create the legend for the plot
Legend <- c(rep("Distance", 1), rep("Price", 1), rep("Distance and Price", 1), rep("Satisfaction", 1), rep("Distance and Satisfaction", 1), rep("Price and Satisfaction", 1), rep("Distance, Price and Satisfaction", 1), rep("Others", 1))
## Generate the count of columns assigned to each factor
dist <- length(which(supermarket_data_model$factor == 0))
price <- length(which(supermarket_data_model$factor == 1))
dist_price <- length(which(supermarket_data_model$factor == 3))
satisf <- length(which(supermarket_data_model$factor == 2))
dist_satisf <- length(which(supermarket_data_model$factor == 4))
price_satisf <- length(which(supermarket_data_model$factor == 5))
dist_price_satisf <- length(which(supermarket_data_model$factor == 6))
others <- length(which(supermarket_data_model$factor == 7))
## Create a vector of values to be shown in the plot
values <- c(dist, price, dist_price, satisf, dist_satisf, price_satisf, dist_price_satisf, others)
## Create a vector of labels to be shown in the plot
labels <- c("no", "yes")
## Create a data frame to store the vectors
factor_count <- data.frame(labels, values)
## Generate a plot
fc_plot <- ggplot(factor_count, aes(labels, values))
## Add featurs to the plot
fc_plot + geom_bar(stat = "identity", aes(fill = Legend)) +
xlab("Class") + ylab("Total") +
ggtitle("Customer Classification") +
theme_bw() + scale_y_continuous(labels = scales::comma)
Observations: It was observed that out of 42499 customers who are willing to travel, 42001 customers choose their most preferred shop based on their ‘Satisfaction’ or ‘Shopping experience’, whereas others decide based on ‘Price’ (4), ‘Price and Satisfaction’ (19) and Others (475). Out of the 17866 customers who do not like to travel long distances and select the shop closest to them as the most preferred shop, 17221 customers opted based on the factor ‘Distance and Satisfaction’, whereas others decided based on ‘Distance’ (620), ‘Distance and Price’ (0) and ‘Distance, Price and Satisfaction’ (25). Thus, the research question is answered by determining the responsible factors for the majority of the customer trend towards long-distance travel. Additionally, it was also found out that ‘Satisfaction’ is a key role factor affecting a customer’s decision-making process.
Applications: These insights obtained can be further utilized by the business to devise strategies to enhance the observed most important factors generating customer satisfaction & retention and eventually a steady growth for the business.
3. What is the maximum likelihood of a customer to select a particular shop?
Algorithms selected: Naive Bayes, Decision tree
Reason for Algorithm Selections: Naive Bayes is a supervised classifier which works on the assumption that all attributes are independent of each other. Because of this, all attributes can be learned separately which results in faster performance. But its accuracy rate is less than that of Decision tree (John and Langley 1995).
Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. A small change in the data can cause a large change in the final estimated tree. However, they are intuitively very easy to explain. They closely mirror human decision-making compared to other regression and classification approaches (Safavian and Landgrebe 1991).
Features Selected: distance_avg, products_purchased_avg, unique_products_purchased_avg, product_price_avg, amount_purchased_avg.
Analysis
1) Data preparation for classification
library(caret)
library(dplyr) # Used by caret
library(e1071)
library(rpart)
library(readr)
library(tidyverse)
library(rpart.plot)
file_path<- "Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
supermarket_data_predict <- read_csv(file_path)
supermarket_data_predict$most_pref_shop=factor(supermarket_data_predict$most_pref_shop)
2) Stratified k-fold cross validation
#stratified k-fold(5)
set.seed(123)
folds <- cut(seq(1,nrow(supermarket_data_predict)),breaks=5,labels=FALSE)
3) Accuracy calculation
#to store accuracy,sensitivity ,specificity for naive bayes and decision tree
resnb<-matrix(ncol=3, nrow=5)
resdec<-matrix(ncol=3, nrow=5)
#Prints 5 folds 5 different train-test dataset combinations
for(i in 1:5){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
test <- supermarket_data_predict[testIndexes, ]
train <- supermarket_data_predict[-testIndexes, ]
ytrain<-train%>%pull(most_pref_shop)
ytest<-test%>%pull(most_pref_shop)
#function to calculate accuracy of different models
printALL=function(model,name,result){
print(name)
testPred=predict(model, newdata=test, type="class")
conftest<-confusionMatrix(ytest,testPred,"YES")
print(conftest$overall["Accuracy"])
#print(conftest)
result[i,1]= conftest$overall["Accuracy"]
print("Sensitivity")
print(max(conftest$byClass[,"Sensitivity"]))
result[i,2]=max(conftest$byClass[,"Sensitivity"])
print("Specificity")
print(max(conftest$byClass[,"Specificity"]))
result[i,3]=max(conftest$byClass[,"Specificity"])
return(result)
}
NBclassfier=naiveBayes(most_pref_shop ~., data=train,laplace=3)
modelr<-rpart(most_pref_shop ~., data=train, method="class",control=rpart.control(cp=0.0001))
resnb=printALL(NBclassfier,"naive bayes",resnb)
resdec=printALL(modelr,"decision trees",resdec)
}
## [1] "naive bayes"
## Accuracy
## 0.3704961
## [1] "Sensitivity"
## [1] 0.6308943
## [1] "Specificity"
## [1] 0.9635243
## [1] "decision trees"
## Accuracy
## 0.5592645
## [1] "Sensitivity"
## [1] 0.606158
## [1] "Specificity"
## [1] 0.9558295
## [1] "naive bayes"
## Accuracy
## 0.4452911
## [1] "Sensitivity"
## [1] 0.6657572
## [1] "Specificity"
## [1] 0.9715398
## [1] "decision trees"
## Accuracy
## 0.6039924
## [1] "Sensitivity"
## [1] 0.6626742
## [1] "Specificity"
## [1] 0.971344
## [1] "naive bayes"
## Accuracy
## 0.4221817
## [1] "Sensitivity"
## [1] 0.7053619
## [1] "Specificity"
## [1] 0.9769286
## [1] "decision trees"
## Accuracy
## 0.6097904
## [1] "Sensitivity"
## [1] 0.6833504
## [1] "Specificity"
## [1] 0.9695692
## [1] "naive bayes"
## Accuracy
## 0.4387476
## [1] "Sensitivity"
## [1] 1
## [1] "Specificity"
## [1] 0.9842553
## [1] "decision trees"
## Accuracy
## 0.68384
## [1] "Sensitivity"
## [1] 0.7697702
## [1] "Specificity"
## [1] 0.9806257
## [1] "naive bayes"
## Accuracy
## 0.4257434
## [1] "Sensitivity"
## [1] 0.7682863
## [1] "Specificity"
## [1] 0.9839271
## [1] "decision trees"
## Accuracy
## 0.6587426
## [1] "Sensitivity"
## [1] 0.7479121
## [1] "Specificity"
## [1] 0.9842744
4) Plotting model’s performance
#cross-validation accuracy plot
par(mfrow=c(2,3))
y<-list(resnb[1,1],resnb[2,1],resnb[3,1],resnb[4,1],resnb[5,1])
ynew<-list(resdec[1,1],resdec[2,1],resdec[3,1],resdec[4,1],resdec[5,1])
x<-list(1,2,3,4,5)
xaxis<-unlist(x)
accuracyplot<-plot(xaxis,unlist(y),type="l" ,lwd=2,col="red",xlab="Folds",ylab="Accuracy",ylim=range( c(y, ynew) ),main="Cross Validation - Accuracy")
lines(xaxis,unlist(ynew),type="l",lwd=2,col="green")
legend("topleft",
legend = c("Naive Bayes", "Decision Tree"),lwd=2,col=c("red","green"))
#cross-validation sensitivity plot
y<-list(resnb[1,2],resnb[2,2],resnb[3,2],resnb[4,2],resnb[5,2])
ynew<-list(resdec[1,2],resdec[2,2],resdec[3,2],resdec[4,2],resdec[5,2])
x<-list(1,2,3,4,5)
xaxis<-unlist(x)
sensitivityplot<-plot(xaxis,unlist(y),type="l" ,lwd=2,col="red",xlab="Folds",ylab="Sensitivity",ylim=range( c(y, ynew) ),main="Cross Validation - Sensitivity")
lines(xaxis,unlist(ynew),type="l",lwd=2,col="green")
legend("topleft",
legend = c("Naive Bayes", "Decision Tree"),lwd=2,col=c("red","green"))
#cross-validation specificity plot
y<-list(resnb[1,3],resnb[2,3],resnb[3,3],resnb[4,3],resnb[5,3])
ynew<-list(resdec[1,3],resdec[2,3],resdec[3,3],resdec[4,3],resdec[5,3])
x<-list(1,2,3,4,5)
xaxis<-unlist(x)
specificityplot<-plot(xaxis,unlist(y),type="l" ,lwd=2,col="red",xlab="Folds",ylab="Specificity",ylim=range( c(y, ynew) ),main="Cross Validation - Specificity")
lines(xaxis,unlist(ynew),type="l",lwd=2,col="green")
legend("topleft",
legend = c("Naive Bayes", "Decision Tree"),lwd=2,col=c("red","green"))
Classification
5) Naive Bayes
#hypertuning isn't recommended for naive bayes because the overall performance is affected by just one parameter i.e laplace
NBclassfier=naiveBayes(most_pref_shop ~., data=train,laplace=3)
print(NBclassfier)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## shop_1 shop_2 shop_3 shop_4 shop_5
## 0.57260002 0.28950965 0.06526961 0.03690052 0.03572020
##
## Conditional probabilities:
## distance_avg
## Y [,1] [,2]
## shop_1 2228.0286 1081.4090
## shop_2 2000.6751 1083.2063
## shop_3 1142.3622 750.0522
## shop_4 1136.5372 939.5079
## shop_5 864.3805 748.5977
##
## products_purchased_avg
## Y [,1] [,2]
## shop_1 734.1665 983.3059
## shop_2 984.0807 1220.7187
## shop_3 865.1342 1024.8160
## shop_4 701.4675 1151.5718
## shop_5 735.0910 1005.2429
##
## unique_products_purchased_avg
## Y [,1] [,2]
## shop_1 153.4718 124.71775
## shop_2 157.2885 114.48737
## shop_3 121.4768 83.23297
## shop_4 101.9315 91.71829
## shop_5 106.8180 84.50388
##
## product_price_avg
## Y [,1] [,2]
## shop_1 4.184790 11.381637
## shop_2 3.207168 7.561573
## shop_3 2.703877 1.293082
## shop_4 2.717495 1.715258
## shop_5 2.620264 1.197237
##
## amount_purchased_avg
## Y [,1] [,2]
## shop_1 8.693031 12.115744
## shop_2 9.339857 9.635032
## shop_3 8.872955 5.950007
## shop_4 8.097596 6.619715
## shop_5 8.277849 6.116294
6) Decision tree (rpart)
#hypertuning rpart
obj3 <- tune.rpart(most_pref_shop~., data =train, minsplit = c(5,10,15))
summary(obj3)
##
## Parameter tuning of 'rpart.wrapper':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## minsplit
## 5
##
## - best performance: 0.4053258
##
## - Detailed performance results:
## minsplit error dispersion
## 1 5 0.4053258 0.01076457
## 2 10 0.4053258 0.01076457
## 3 15 0.4053258 0.01076457
#decision tree classifier
modelr<-rpart(most_pref_shop ~., data=train, method="class",control=rpart.control(cp=0.001))
print(modelr$cptable)
## CP nsplit rel error xerror xstd
## 1 0.014437984 0 1.0000000 1.0000000 0.005267091
## 2 0.009593023 3 0.9465116 0.9462694 0.005225369
## 3 0.009011628 4 0.9369186 0.9378391 0.005217753
## 4 0.007267442 5 0.9279070 0.9352713 0.005215374
## 5 0.005087209 6 0.9206395 0.9268411 0.005207375
## 6 0.004360465 8 0.9104651 0.9201550 0.005200821
## 7 0.004215116 9 0.9061047 0.9185078 0.005199177
## 8 0.002664729 11 0.8976744 0.9140988 0.005194723
## 9 0.002567829 12 0.8950097 0.9078488 0.005188269
## 10 0.002470930 13 0.8924419 0.9072190 0.005187609
## 11 0.002083333 16 0.8850291 0.8999031 0.005179827
## 12 0.001534238 17 0.8829457 0.8921027 0.005171280
## 13 0.001514050 20 0.8783430 0.8900678 0.005169007
## 14 0.001501938 24 0.8722868 0.8890019 0.005167810
## 15 0.001405039 26 0.8692829 0.8868217 0.005165347
## 16 0.001146641 27 0.8678779 0.8857558 0.005164135
## 17 0.001065891 31 0.8630329 0.8820736 0.005159911
## 18 0.001000000 32 0.8619671 0.8808624 0.005158509
plotcp(modelr)
prp(modelr)
#the relative error is reduced but the graph is overplotted
modelr<-rpart(most_pref_shop ~., data=train, method="class",control=rpart.control(cp=0.0001))
print(modelr$cptable)
## CP nsplit rel error xerror xstd
## 1 0.0144379845 0 1.0000000 1.0000000 0.005267091
## 2 0.0095930233 3 0.9465116 0.9466570 0.005225712
## 3 0.0090116279 4 0.9369186 0.9451550 0.005224379
## 4 0.0072674419 5 0.9279070 0.9367248 0.005216724
## 5 0.0050872093 6 0.9206395 0.9287791 0.005209240
## 6 0.0043604651 8 0.9104651 0.9218023 0.005202453
## 7 0.0042151163 9 0.9061047 0.9205426 0.005201206
## 8 0.0026647287 11 0.8976744 0.9169574 0.005197620
## 9 0.0025678295 12 0.8950097 0.9090116 0.005189482
## 10 0.0024709302 13 0.8924419 0.9076066 0.005188015
## 11 0.0020833333 16 0.8850291 0.8993217 0.005179198
## 12 0.0015342377 17 0.8829457 0.8887597 0.005167538
## 13 0.0015140504 20 0.8783430 0.8852229 0.005163527
## 14 0.0015019380 24 0.8722868 0.8852229 0.005163527
## 15 0.0014050388 26 0.8692829 0.8838663 0.005161974
## 16 0.0011466408 27 0.8678779 0.8818798 0.005159687
## 17 0.0010658915 31 0.8630329 0.8780523 0.005155232
## 18 0.0009447674 32 0.8619671 0.8774709 0.005154550
## 19 0.0008963178 34 0.8600775 0.8765988 0.005153524
## 20 0.0007613511 36 0.8582849 0.8747093 0.005151290
## 21 0.0007428941 45 0.8510659 0.8741764 0.005150657
## 22 0.0007267442 48 0.8488372 0.8738857 0.005150311
## 23 0.0006419574 50 0.8473837 0.8692345 0.005144729
## 24 0.0006298450 56 0.8412791 0.8678295 0.005143025
## 25 0.0006056202 59 0.8393895 0.8681202 0.005143378
## 26 0.0005813953 61 0.8381783 0.8679264 0.005143143
## 27 0.0005571705 62 0.8375969 0.8675388 0.005142671
## 28 0.0005329457 64 0.8364826 0.8673934 0.005142494
## 29 0.0005087209 67 0.8348837 0.8671996 0.005142258
## 30 0.0004844961 71 0.8328488 0.8671027 0.005142140
## 31 0.0004360465 72 0.8323643 0.8676357 0.005142789
## 32 0.0003875969 79 0.8289244 0.8662791 0.005141134
## 33 0.0003714470 94 0.8231105 0.8656008 0.005140304
## 34 0.0003391473 99 0.8210271 0.8641957 0.005138578
## 35 0.0003149225 113 0.8162791 0.8652616 0.005139888
## 36 0.0003028101 119 0.8143895 0.8651647 0.005139769
## 37 0.0002906977 125 0.8123062 0.8649225 0.005139472
## 38 0.0002745478 135 0.8093992 0.8633236 0.005137502
## 39 0.0002664729 143 0.8070736 0.8634205 0.005137621
## 40 0.0002583979 147 0.8060078 0.8647287 0.005139233
## 41 0.0002543605 161 0.8022771 0.8650194 0.005139591
## 42 0.0002422481 165 0.8012597 0.8659884 0.005140779
## 43 0.0002260982 190 0.7948643 0.8661822 0.005141016
## 44 0.0002180233 196 0.7935078 0.8655523 0.005140245
## 45 0.0002131783 224 0.7867248 0.8655523 0.005140245
## 46 0.0002099483 230 0.7854167 0.8653585 0.005140007
## 47 0.0002034884 250 0.7797965 0.8653585 0.005140007
## 48 0.0001937984 256 0.7785368 0.8652616 0.005139888
## 49 0.0001877422 321 0.7650678 0.8656492 0.005140363
## 50 0.0001816860 331 0.7630814 0.8673934 0.005142494
## 51 0.0001776486 336 0.7621124 0.8677326 0.005142907
## 52 0.0001695736 346 0.7603198 0.8678779 0.005143084
## 53 0.0001614987 392 0.7518895 0.8685562 0.005143908
## 54 0.0001574612 441 0.7424903 0.8747093 0.005151290
## 55 0.0001534238 445 0.7418605 0.8747093 0.005151290
## 56 0.0001453488 452 0.7406977 0.8764535 0.005153352
## 57 0.0001372739 566 0.7234981 0.8788760 0.005156196
## 58 0.0001356589 572 0.7226744 0.8788760 0.005156196
## 59 0.0001332364 593 0.7186531 0.8861434 0.005164576
## 60 0.0001291990 601 0.7175872 0.8862888 0.005164741
## 61 0.0001259690 614 0.7158915 0.8871124 0.005165676
## 62 0.0001211240 623 0.7144380 0.8899225 0.005168845
## 63 0.0001162791 728 0.7007752 0.8919574 0.005171118
## 64 0.0001130491 734 0.7000000 0.8938469 0.005173213
## 65 0.0001107420 776 0.6945736 0.8939438 0.005173320
## 66 0.0001090116 788 0.6927326 0.8939438 0.005173320
## 67 0.0001049742 808 0.6903101 0.9052810 0.005185570
## 68 0.0001000000 824 0.6884205 0.9054264 0.005185723
Observations: The graph for the overall performance of the model with respect to accuracy, specificity and sensitivity of each fold suggest that Decision tree have better accuracy than Naive Bayes on the input dataset used. Thus, the research question is answered by visualizing the predictive model performance with algorithms Naive Bayes and Decision tree, suggesting that the predictive model based on Decision tree will output better results in predicting the maximum likelihood of a new customer to select a particular shop.
Applications: These insights obtained can be further utilized by the business to understand which shops in the retail chain are most likely to be preferred by new customers. Predicting the maximum likelihood of new customers towards shop selection will further facilitate towards better stock management to meet the increasing customer demands. Relevantly, different strategies to increase profit and attract new customers in different shops can be built upon accordingly.
4. What are the different customer segments based on their purchase behaviour?
Algorithms selected: K-means, Principal Component Analysis (PCA)
Reason for Algorithm Selections: K-means clustering is a very simple and fast algorithm (Hartigan and Wong 1979). It is the popular method used for customer segmentation and especially for numerical data. K-means also has computational advantages in terms of it scaling well with large datasets. Hierarchical and model-based clustering methods require to calculate a full distance matrix exhibiting limited scalability and large memory requirements for computation on a large dataset. Comparably, K-means clustering is more run-time efficient. Considering these facts and given that the input dataset is large and mainly contains numerical data, k-means was an ideal choice for customer segmentation.
PCA is a dimensionality reduction algorithm which visualizes the essence of the dataset through data decomposition and transformation into principal components (PC) maximizing the linear variance of the data (More variance indicates more understanding of the data) (Ding and He 2004). As compared to K-means PCA is not a direct solution thus finding it from a different perspective and can help to detect customer clusters not found by K-means. PCA is used here for this research question as a valuable cross-check to K-means number of clusters determination.
Features Selected: The RQ wants to identify the clusters of customers based on their purchase behaviour i.e. the shops which the customer prefers for the purchase. Hence the features, customer_id, amount_purchased_shop_1, amount_purchased_shop_2, amount_purchased_shop_3, amount_purchased_shop_4, amount_purchased_shop_5 are of importance and are selected for this RQ.
Analysis
1) Estimating the optimal number of clusters
K-means requires to specify the number of clusters prior to the algorithm start. Determining the optimal number of clusters is crucial to output better results.
library(cluster)
library(factoextra)
library("metricsgraphics")
# Read file contents
supermarket_data_clean <- read.csv("Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv")
# Prepare data frames for clustering
# Select only the rows customer_id, amount_purchased_shop_1, 2, 3, 4, 5
cluster.slice.temp <- supermarket_data_clean[,c(1,29,30,31,32,33)]
# Remover customer_id from the clustering data frame
cluster.slice.data <- supermarket_data_clean[,c(29,30,31,32,33)]
# Scale the data and Determine the ideal number of clusters
cluster.slice.scale <- scale(cluster.slice.data)
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(cluster.slice.scale)
The above plot depicts a sharp decrease for number of clusters from values 1 to 4 with a slight decrease from 4 to 5 which estimates a 4-cluster or 5-cluster solution.
2) Perform K-means clustering with the number of clusters as 4 and 5
# Perform k-means on cluster values as 4 and 5
# On entire dataset
set.seed(123) # fix the random starting clusters
kclust4 <- kmeans(cluster.slice.data, 4, nstart = 25)
set.seed(123) # fix the random starting clusters
kclust5 <- kmeans(cluster.slice.data, 5, nstart = 25)
3) Perform PCA to visualize the clusters
pca <- prcomp(t(cluster.slice.data), scale. = T, center = T)
fviz_eig(pca) +
theme_bw() + scale_y_continuous(labels = scales::comma) +
ggtitle(label='Principal Component Analysis')
As observed, PCA 1 and PCA 2 combined explain the majority of the data variance and then there is a drop from PCA 2 to PCA 3. Hence this infers that visualization with PCA 1 and PC 2 will give a good understanding of the data and including more PCA’s after PCA 2 will only result in minimal improvement.
PCA with 4 cluster K-means
cluster.pc4 <- prcomp(cluster.slice.data, center = FALSE, scale. = FALSE)$x %>% as.data.frame()
cluster.pc4$kmeans.cluster <- factor(kclust4$cluster)
p<-ggplot(cluster.pc4,aes(x=PC1,y=PC2,color=kmeans.cluster))
p+geom_point() +
theme_bw() + scale_y_continuous(labels = scales::comma) +
ggtitle(label='PCA with 4 cluster K-means')
PCA with 5 cluster K-means
cluster.pc5 <- prcomp(cluster.slice.data, center = FALSE, scale. = FALSE)$x %>% as.data.frame()
cluster.pc5$kmeans.cluster <- factor(kclust5$cluster)
p<-ggplot(cluster.pc5,aes(x=PC1,y=PC2,color=kmeans.cluster))
p+geom_point() +
theme_bw() + scale_y_continuous(labels = scales::comma) +
ggtitle(label='PCA with 5 cluster K-means')
Comparing the above two plots determine that a 5 cluster solution will be an ideal estimate for K-means clustering.
4) Visualize the different separable clusters in the data
fviz_cluster(kclust5, data = cluster.slice.data, geom = "point",
stand = FALSE, ellipse.type = "norm") +
theme_bw() + scale_y_continuous(labels = scales::comma) +
ggtitle(label='Customer Clusters')
5) Cluster Analysis
Determine the different customers belonging to each cluster
## retrieve customer ID's in each cluster
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 1,])))
## retrieve customer ID's in each cluster
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 2,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 3,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 4,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 5,])))
6) Customer Segmentation
#Customer segmentation through aggeration of results by mean
cluster.slice.kmeans.aggregate <- aggregate(cluster.slice.data, by = list(kclust5$cluster), mean)
cluster<-c(cluster.slice.kmeans.aggregate$Group.1)
shop1<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_1)
shop2<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_2)
shop3<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_3)
shop4<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_4)
shop5<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_5)
# Plot a Bar graph
Legends <-c(rep("Customers Shop 1", 5), rep("Customers Shop 2", 5), rep("Customers Shop 3", 5), rep("Customers Shop 4", 5), rep("Customers Shop 5", 5))
values <-c(shop1,shop2,shop3,shop4,shop5)
mydata <-data.frame(cluster, values)
p <-ggplot(mydata, aes(cluster, values))
p +geom_bar(stat = "identity", aes(fill = Legends)) +
xlab("Cluster") + ylab("Total") +
ggtitle("Customer Segmentation") +
theme_bw() + scale_y_continuous(labels = scales::comma)
Observations: Clustering the data based on the purchase behaviour of customers i.e. from the shops they shop the most, revealed 5 separable clusters to analyze. Cluster analysis helped to identify the customers in each cluster based on their customer IDs. This is useful to understand the different customers that build the customer base in each cluster. Further, Customer Segmentation facilitated to identify the customers of different shops in each segment (cluster). This further helped to divide the cluster and attach meaning to it. Thus, the research question is answered by identifying five customer segments based on their purchase behaviour and further partitioning these segments by determining the specific customers belonging to the five different shops.
Applications: Detection of clusters can help the business to develop a specific strategy for each cluster base. Clustering can also be used to understand the purchase behaviour of customers by keeping a track of customers over months and detecting the number of customers moving from one cluster to other. This helps the business to better organize strategies to increase revenue at different shops. Customer Segmentation insights obtained can be further utilized by the business to better focus their marketing efforts on the right customers. Eg. Discounts and offers related to a particular shop can be sent to only the customers who usually purchase at the particular shop without bothering the customers of other shops. Thus, targeting the right customers for the right deals can help to cut-down the marketing costs, generate more revenue and increase customer satisfaction.
5. Which are the Top 100 customers that are most profitable in terms of revenue generation for each shop?
Algorithms selected: Custom Ranking Algorithm
Features Selected: amount_purchased_shop_1, amount_purchased_shop_2, amount_purchased_shop_3, amount_purchased_shop_4, amount_purchased_shop_5.
Analysis
1) Generate a loyality score of the customers for each shop
It is assumed that more amount the customers spend at a shop, more loyal they are to that shop. Based on this assumption, loyalty score is defined as the contribution of each customer towards the revenue of each shop i.e ratio of the amount spent by the customer at a shop to the total revenue generated by a shop.
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(plotly)
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)
## Calculate the revenue generated for each shop
revenue_shop_1 <- sum(supermarket_data_clean$amount_purchased_shop_1)
revenue_shop_2 <- sum(supermarket_data_clean$amount_purchased_shop_2)
revenue_shop_3 <- sum(supermarket_data_clean$amount_purchased_shop_3)
revenue_shop_4 <- sum(supermarket_data_clean$amount_purchased_shop_4)
revenue_shop_5 <- sum(supermarket_data_clean$amount_purchased_shop_5)
## Calculate the Loyality Score of the customers for each shop based on their contribution to the revenue of the shop
supermarket_data_clean$loyality_score_shop_1 <- with(supermarket_data_clean, amount_purchased_shop_1/revenue_shop_1 * 100)
supermarket_data_clean$loyality_score_shop_2 <- with(supermarket_data_clean, amount_purchased_shop_2/revenue_shop_2 * 100)
supermarket_data_clean$loyality_score_shop_3 <- with(supermarket_data_clean, amount_purchased_shop_3/revenue_shop_3 * 100)
supermarket_data_clean$loyality_score_shop_4 <- with(supermarket_data_clean, amount_purchased_shop_4/revenue_shop_4 * 100)
supermarket_data_clean$loyality_score_shop_5 <- with(supermarket_data_clean, amount_purchased_shop_5/revenue_shop_5 * 100)
2) Retrieve list of Top 100 customers for each shop
## Sort the column - Loyality Score for each shop in descending order and generate a list of top 100 customers for each shop
shop_1 <- head(order(supermarket_data_clean$loyality_score_shop_1, decreasing = TRUE), 100)
shop_2 <- head(order(supermarket_data_clean$loyality_score_shop_2, decreasing = TRUE), 100)
shop_3 <- head(order(supermarket_data_clean$loyality_score_shop_3, decreasing = TRUE), 100)
shop_4 <- head(order(supermarket_data_clean$loyality_score_shop_4, decreasing = TRUE), 100)
shop_5 <- head(order(supermarket_data_clean$loyality_score_shop_5, decreasing = TRUE), 100)
shop_1
## [1] 44750 11190 33212 123 42269 40684 5725 12696 50455 37418 6744
## [12] 34972 27723 41131 34205 26898 47087 24745 10484 35854 49638 29322
## [23] 5520 19282 5720 33376 4635 30654 34379 43371 17448 43096 21575
## [34] 25669 45960 8109 18592 33326 45012 43094 10093 35385 5545 42703
## [45] 46362 30140 19049 30914 17246 55458 53792 11766 21880 44224 33716
## [56] 11992 17291 11028 47131 22814 49961 10871 44622 15316 7276 42879
## [67] 27010 9909 31624 39152 42621 33796 29832 38239 2928 29163 22788
## [78] 25574 15117 32436 23976 14756 42252 44199 22696 13562 44245 14290
## [89] 2595 15337 41464 41054 15084 35011 7096 6005 11773 46673 2660
## [100] 44157
3) Visualize Top Ranked 5 customers for each shop
## Create the legend for the plot
Legend <- c(rep("Shop 1", 5), rep("Shop 2", 5), rep("Shop 3", 5), rep("Shop 4", 5), rep("Shop 5", 5))
## Generate the count of columns assigned to each factor
amt_shop_1 <- supermarket_data_clean[head(shop_1, 5), "amount_purchased_shop_1"]
amt_shop_2 <- supermarket_data_clean[head(shop_2, 5), "amount_purchased_shop_2"]
amt_shop_3 <- supermarket_data_clean[head(shop_3, 5), "amount_purchased_shop_3"]
amt_shop_4 <- supermarket_data_clean[head(shop_4, 5), "amount_purchased_shop_4"]
amt_shop_5 <- supermarket_data_clean[head(shop_5, 5), "amount_purchased_shop_5"]
## Create a vector of values to be shown in the plot
values <- c(unlist(amt_shop_1), unlist(amt_shop_2), unlist(amt_shop_3), unlist(amt_shop_4), unlist(amt_shop_5))
## Create a vector of labels to be shown in the plot
labels <- c(head(shop_1, 5), head(shop_2, 5), head(shop_3, 5), head(shop_4, 5), head(shop_5, 5))
labels_factor <- factor(labels, levels = labels)
## Create a data frame to store the vectors
top_cust <- data.frame(labels_factor, values)
## Generate a plot
tc_plot <- ggplot(top_cust, aes(labels_factor, values))
## Add featurs to the plot
tc_plot + geom_bar(stat = "identity", aes(fill = Legend), width = 0.6) +
xlab("Customer ID") + ylab("Amount spent") +
ggtitle("Top 5 Customers for each Shop") +
theme_bw() + scale_y_continuous(labels = scales::comma) +
theme(axis.text.x=element_text(angle=90, hjust=1))
Observations: Top 100 customers that spend the most amount of money in each of the 5 shops based on their loyalty score were ranked and determined. Thus, the research question is answered by retrieving and ranking the Top 100 profitable customers for each shop based on their loyalty score.
Applications: These insights obtained can be further utilized by the business to identify the customers who contribute majorly to the revenue of the respective shops. This can further facilitate to formulate reward schemes for them and in turn retain the high-value customer base. Eventually, targeting appropriate loyalty programs can transform satisfied loyal customers as advocates for the business.
GitHub Repository
https://github.com/Rspawar/Data-Science-with-R.git
Project Website
https://sites.google.com/view/customerbehaviouralanalytics
Project Screencast
Project Presentation
Complete Analysis Report as RMarkdown file
Complete Analysis Report as HTML file
Complete Analysis Report as HTML file rendered in the browser
January 15, 2019
Ding, Chris, and Xiaofeng He. 2004. “K-Means Clustering via Principal Component Analysis.” In Proceedings of the Twenty-First International Conference on Machine Learning, 29. ACM.
Hartigan, John A, and Manchek A Wong. 1979. “Algorithm as 136: A K-Means Clustering Algorithm.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1). JSTOR: 100–108.
Hsu, Chih-Wei, Chih-Chung Chang, Chih-Jen Lin, and others. 2003. “A Practical Guide to Support Vector Classification.” Taipei.
John, George H, and Pat Langley. 1995. “Estimating Continuous Distributions in Bayesian Classifiers.” In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 338–45. Morgan Kaufmann Publishers Inc.
Liaw, Andy, Matthew Wiener, and others. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22.
Pennacchioli, Diego, Michele Coscia, Salvatore Rinzivillo, Dino Pedreschi, and Fosca Giannotti. 2013. “Explaining the Product Range Effect in Purchase Data.” In Big Data, 2013 Ieee International Conference on, 648–56. IEEE.
Safavian, S Rasoul, and David Landgrebe. 1991. “A Survey of Decision Tree Classifier Methodology.” IEEE Transactions on Systems, Man, and Cybernetics 21 (3). IEEE: 660–74.
Soucy, Pascal, and Guy W Mineau. 2001. “A Simple Knn Algorithm for Text Categorization.” In Data Mining, 2001. Icdm 2001, Proceedings Ieee International Conference on, 647–48. IEEE.